The Study Case and Data:

In our case study, we are employees of the fictitious company “217”, which produces sheet metal parts for vehicle bodies of German car manufacturers. Recently, there have been more and more customer complaints about premature rust spots on the company products. Intensive research revealed that this may be caused by the purchase of steel whose composition did not comply with the company’s specifications. An internal investigation revealed that this steel was processed in the company plants in the period from June 1, 2012 to July 31, 2014. Our goal is to analyze this situation with the production data and parts lists of our customers and to identify the customers and vehicles affected.Therefore we aim to identify critical production quantities, isolate them and inform customers and car manufacturers who have installed critical components in their vehicles from the available data.

We possess the information that body parts produced by our company are installed in “Type11”, “Type12”, “Type21” and “Type22” vehicles of “OEM1” and “OEM2” car brands.

Lastly, for reference, we also have the following translation which will be helpful for understanding the data:

  • Single Part = Einzelteil
  • Component = Komponente
  • Vehicle = Fahrzeug
  • Geodata = Geodaten
  • Registrations = Zulassungen
  • Logistics delay = Logistikverzug

1. Setting:

1.1 Import and loading of necessary packages for the analysis:

if(!require("install.load")){
   install.packages("install.load")
}
library('install.load')
install_load("readr", "readxl", "tidyr", "dplyr", "stringr", "tidyverse", "knitr", "rmarkdown", "ggplot2","lubridate","kableExtra")

2. Importing the data:

2.1 Creation of a function to import poorly formatted flat files:

First, we will construct a function that will enable us to read messy flat files (.txt)

read_txt <- function(path, replacement){
  
  path %>% # Take the file from the path
    read_file() %>% # Import it as a string
    str_replace_all(replacement) %>% # Apply a specific replacement for each dataset
    str_replace_all(c("\n\"\\d+\"," = "\n", "\nNA," = "\n")) %>% # Apply general replacement
    read_delim(show_col_types = FALSE) # Read it as a dataframe
}

2.2 Identification of relevant Databases:

Recall the fact that we know our parts are installed in “Type11”, “Type12”, “Type21” and “Type22” vehicles of “OEM1” and “OEM2” car brands. Furthermore, we also know that the provided data are in the form relational databases which are combined through specific keys/IDs. Leading us to the following graph:

This information enabled us to identify the relevant datasets for our analysis, which are:

  • Geodaten_Gemeinden_v1.2_2017-08-22_TrR: Geodata with latitude and longitude data of corresponding community and potcodes.
    • Geodaten_Gemeinden_v1.2_2017-08-22_TrR
  • Zulassungen_alle_Fahrzeuge: Registration data of vehicles
    • Zulassungen_alle_Fahrzeuge
  • Fahrzeuge_ID: Data of produced cars
    • Fahrzeuge_OEM1_Typ11
    • Fahrzeuge_OEM1_Typ12
    • Fahrzeuge_OEM1_Typ21
    • Fahrzeuge_OEM1_Typ22
  • Bestandteile_Fahrzeuge: Relation table with the ID of components installed in the respective car types
    • Bestandteile_Fahrzeuge_OEM1_Typ11
    • Bestandteile_Fahrzeuge_OEM1_Typ12
    • Bestandteile_Fahrzeuge_OEM2_Typ21
    • Bestandteile_Fahrzeuge_OEM2_Typ22
  • Komponente: Data files of the components installed in the car body parts
    • Komponente_k4
    • Komponente_k5
    • Komponente_k6
    • Komponente_k7
  • Bestandteile_Komponente: Relational Databases, with the ID of parts installed in specific components
    • Bestandteile_Komponente_k4
    • Bestandteile_Komponente_k5
    • Bestandteile_Komponente_k6
    • Bestandteile_Komponente_k7
  • Einzelteile: Data of Parts produced by our company
    • Einzelteil_T30
    • Einzelteil_T31
    • Einzelteil_T32
    • Einzelteil_T33
    • Einzelteil_T34
    • Einzelteil_T35
    • Einzelteil_T37

2.3 Importing relevant DB:

After having identified the relevant databases for our analysis, it is time to import these files:

2.3.1 Importing of Geodata:

geodata <- read_csv2("Data/Geodaten/Geodaten_Gemeinden_v1.2_2017-08-22_TrR.csv")

2.3.2 Importing of Registrations:

registrations <- read_csv2("Data/Zulassungen/Zulassungen_alle_Fahrzeuge.csv")

2.3.3 Importing of cars data:

car_type11 <- read_csv("Data/Fahrzeug/Fahrzeuge_OEM1_Typ11.csv")


car_type12 <- read_csv2("Data/Fahrzeug/Fahrzeuge_OEM1_Typ12.csv")


car_type21 <- read_csv("Data/Fahrzeug/Fahrzeuge_OEM2_Typ21.csv")


car_type22 <- read_csv2("Data/Fahrzeug/Fahrzeuge_OEM2_Typ22.csv")

2.3.4 Importing of Car parts data,i.e., Bestandteile_Fahrzeuge Data:

car_parts_type11 <-read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM1_Typ11.csv")


car_parts_type12 <-read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM1_Typ12.csv")


car_parts_type21 <-read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ21.csv")


car_parts_type22 <-read_csv2("Data/Fahrzeug/Bestandteile_Fahrzeuge_OEM2_Typ22.csv")

2.3.5 Importing of components installed in the car body parts:

comp_4 <- read_csv2("Data/Komponente/Komponente_k4.csv")


comp_5 <- read_csv("Data/Komponente/Komponente_k5.csv")


comp_6 <- read_csv2("Data/Komponente/Komponente_k6.csv")


comp_7 <- read_txt("Data/Komponente/Komponente_K7.txt", 
                   replacement = c("\t" = ",", "\r" = "\n"))

2.3.6 Importing of Components Parts data,i.e, Bestandteile_Komponente:

comp_parts_4 <- read_csv2("Data/Komponente/Bestandteile_Komponente_k4.csv")


comp_parts_5 <- read_csv2("Data/Komponente/Bestandteile_Komponente_k5.csv")


comp_parts_6 <- read_csv2("Data/Komponente/Bestandteile_Komponente_k6.csv")


comp_parts_7 <- read_csv2("Data/Komponente/Bestandteile_Komponente_k7.csv")

2.3.7 Finally the import of the relevant parts produced by our company:

P_30 <- read_csv("Data/Einzelteil/Einzelteil_T30.csv")

P_31 <- read_txt("Data/Einzelteil/Einzelteil_T31.txt", 
                 replacement = c("  " = ",", "\b" = "\n"))

P_32 <- read_csv2("Data/Einzelteil/Einzelteil_T32.csv")

P_33 <- read_csv("Data/Einzelteil/Einzelteil_T33.csv")

P_34 <- read_txt("Data/Einzelteil/Einzelteil_T34.txt", 
                 replacement = c(" \\| \\| " = ",", "\"\"" = "\"\n\""))

P_35 <- read_txt("Data/Einzelteil/Einzelteil_T35.txt", 
                replacement =c("\\\\" = ",", "\"\"" = "\"\n\"", "NA\"" = "NA\n\""))

P_37 <- read_csv("Data/Einzelteil/Einzelteil_T37.csv")

3. Data preparation:

In this section, we will proceed to tidy, clean and filter the datasets imported in the previous section. The goal is to have all datasets organized so that we can easily identify the relevant information and leave these data sets ready for the joining procedure.

3.1 Creation of auxiliary functions:

In this part we will create three functions:

  • Drop_Cols: Which will help us remove unnecessary recurrent columns from the datasets.
  • Prod_date: Which will fix inconsistencies regarding the production date column and erase unncessary columns related to the production.
  • Fix_unstruct_dat: Which will convert unstructred data sets into structured ones.
  • Filter_217_dates: Which will filter for company 217 and production dates between the 1st of June, 2012 and the 31st of July, 2014
#Function to drop unnecessary columns

Drop_Cols <- function(df){
  if('V1' %in% names(df)) {df <- df %>% dplyr::select(-V1)}
  if('X1' %in% names(df)) {df <- df %>% dplyr::select(-X1)}
  if('X' %in% names(df)) {df <- df %>% dplyr::select(-X)}
  if('...1' %in% names(df)) {df <- df %>% dplyr::select(-...1)}
  return(df)
}

#Function to transform date columns

Prod_date <- function(df){
  starting_date <- as.Date("1970-01-01")
  df %>%
    mutate(Produktionsdatum = starting_date + Produktionsdatum_Origin_01011970) %>%
    dplyr::select(-Produktionsdatum_Origin_01011970,-origin)
}


#Function to fix unstructured Data

Fix_unstruct_dat <- function(data){
  #If the dataset has 14 columns we must convert it to only 7 columns:
  if (ncol(data) == 14) {
    
    df1 <- data[,1:7]
    names(df1) <- sub("\\.x", "", names(df1))
    df2 <- data[,8:14]
    names(df2) <- sub("\\.y", "", names(df2))
    
    #combine dataframes
    df <- dplyr::bind_rows(df1, df2) %>% distinct() %>% drop_na(Werksnummer)
    return(df)
    
    #Same logic as in the second case when it has 21: 
  } else if(ncol(data) == 21) {
    
    df1 <- data[,1:7]
    names(df1) <- sub("\\.x", "", names(df1))
    df2 <- data[,8:14]
    names(df2) <- sub("\\.y", "", names(df2))
    df3 <- data[,15:21]
    
    # Combine Datasets
    df <- dplyr::bind_rows(df1, df2, df3) %>% distinct() %>% drop_na(Werksnummer)
    return(df)
  } else {
    print("Data cannot be processed with this function yet, prior changes are required!")
  }
}


#Function to filter for company 217 and production dates indicated on the task

Filter_217_dates <- function(df){
  df %>%
  filter(Herstellernummer == 217)%>%
  filter(Produktionsdatum >= "2012-06-01" & Produktionsdatum <= "2014-07-31")
}

3.2 Tidying, filtering and cleaning of the data:

We intend to tidy the datasets such that they all repect the following conditions:

Thus, a dataset should meet three key points to be tidy:

  • Each variable is mapped in a separate column
  • Each observation (to a statistic unit) is mapped in one line (row)
  • Each observation value is in a separate cell

3.2.1 Tidying of Geodata:

#Geodata
geodata <- Drop_Cols(geodata)
# The first 846 obs of the dataset have only 4 digits but we know that in Germany Postal codes
#have 5 digits, thus we must transform the Postleitzahl variable
geodata$Postleitzahl <- as.character(geodata$Postleitzahl)
geodata$Postleitzahl[nchar(geodata$Postleitzahl) == 4] <-paste("0", geodata$Postleitzahl[nchar(geodata$Postleitzahl) == 4], sep = "")
glimpse(geodata)
## Rows: 7,775
## Columns: 4
## $ Postleitzahl <chr> "01067", "01445", "01454", "01454", "01458", "01468", "01…
## $ Gemeinde     <chr> "DRESDEN", "RADEBEUL", "RADEBERG", "WACHAU", "OTTENDORF O…
## $ Laengengrad  <dbl> 13.73688, 13.65934, 13.92197, 13.90468, 13.83172, 13.6800…
## $ Breitengrad  <dbl> 51.05170, 51.10598, 51.11640, 51.16012, 51.18762, 51.1594…
kable(head(geodata)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(geodata), width_max = "12em", width_min = "8em")
Postleitzahl Gemeinde Laengengrad Breitengrad
01067 DRESDEN 13.73688 51.05170
01445 RADEBEUL 13.65934 51.10598
01454 RADEBERG 13.92197 51.11640
01454 WACHAU 13.90468 51.16012
01458 OTTENDORF OKRILLA 13.83172 51.18762
01468 MORITZBURG 13.68009 51.15949

3.2.2 Tidying of Registrations:

#Registrations
registrations <- Drop_Cols(registrations)
glimpse(registrations)
## Rows: 3,204,104
## Columns: 3
## $ IDNummer  <chr> "11-1-11-1", "11-1-11-2", "12-1-12-1", "12-1-12-2", "12-1-12…
## $ Gemeinden <chr> "DRESDEN", "DRESDEN", "LEIPZIG", "LEIPZIG", "DORTMUND", "DOR…
## $ Zulassung <date> 2009-01-01, 2009-01-01, 2009-01-01, 2009-01-01, 2009-01-01,…
kable(head(registrations)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(registrations), width_max = "12em", width_min = "8em")
IDNummer Gemeinden Zulassung
11-1-11-1 DRESDEN 2009-01-01
11-1-11-2 DRESDEN 2009-01-01
12-1-12-1 LEIPZIG 2009-01-01
12-1-12-2 LEIPZIG 2009-01-01
12-1-12-3 DORTMUND 2009-01-01
12-1-12-4 DORTMUND 2009-01-01

3.2.3 Tidying of cars data:

#Cars

#Selection and Reordering of important columns:
order_car <-c("ID_Fahrzeug","Herstellernummer","Werksnummer")


car_type11 <- Drop_Cols(car_type11) %>% dplyr::select(order_car) %>% relocate(order_car)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(order_car)` instead of `order_car` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
glimpse(car_type11)
## Rows: 1,977,164
## Columns: 3
## $ ID_Fahrzeug      <chr> "11-1-11-1", "11-1-11-2", "11-1-11-3", "11-1-11-4", "…
## $ Herstellernummer <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Werksnummer      <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
kable(head(car_type11)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_type11), width_max = "12em", width_min = "8em")
ID_Fahrzeug Herstellernummer Werksnummer
11-1-11-1 1 11
11-1-11-2 1 11
11-1-11-3 1 11
11-1-11-4 1 11
11-1-11-5 1 11
11-1-11-6 1 11
car_type12 <- Drop_Cols(car_type12) %>% dplyr::select(order_car) %>% relocate(order_car)
glimpse(car_type12)
## Rows: 408,096
## Columns: 3
## $ ID_Fahrzeug      <chr> "12-1-12-1", "12-1-12-2", "12-1-12-3", "12-1-12-4", "…
## $ Herstellernummer <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Werksnummer      <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…
kable(head(car_type12)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_type12), width_max = "12em", width_min = "8em")
ID_Fahrzeug Herstellernummer Werksnummer
12-1-12-1 1 12
12-1-12-2 1 12
12-1-12-3 1 12
12-1-12-4 1 12
12-1-12-5 1 12
12-1-12-6 1 12
car_type21 <- Drop_Cols(car_type21) %>% Prod_date() %>% dplyr::select(order_car) %>% relocate(order_car)
glimpse(car_type21)
## Rows: 512,354
## Columns: 3
## $ ID_Fahrzeug      <chr> "21-2-21-1", "21-2-21-2", "21-2-21-3", "21-2-21-4", "…
## $ Herstellernummer <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ Werksnummer      <dbl> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 2…
kable(head(car_type21)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_type21), width_max = "12em", width_min = "8em")
ID_Fahrzeug Herstellernummer Werksnummer
21-2-21-1 2 21
21-2-21-2 2 21
21-2-21-3 2 21
21-2-21-4 2 21
21-2-21-5 2 21
21-2-21-6 2 21
car_type22 <- Drop_Cols(car_type22) %>% Prod_date() %>% dplyr::select(order_car) %>% relocate(order_car)
glimpse(car_type22)
## Rows: 306,490
## Columns: 3
## $ ID_Fahrzeug      <chr> "22-2-21-1", "22-2-21-2", "22-2-21-3", "22-2-21-4", "…
## $ Herstellernummer <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ Werksnummer      <dbl> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 2…
kable(head(car_type22)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_type22), width_max = "12em", width_min = "8em")
ID_Fahrzeug Herstellernummer Werksnummer
22-2-21-1 2 21
22-2-21-2 2 21
22-2-21-3 2 21
22-2-21-4 2 21
22-2-21-5 2 21
22-2-21-6 2 21
rm(order_car)

3.2.4 Tidying of cars Car parts data,i.e., Bestandteile_Fahrzeuge Data:

# Since we only care about the body parts, we only select the columns ID_Karosserie and ID_Fahrzeug

car_parts_type11 <- Drop_Cols(car_parts_type11) %>% dplyr::select(ID_Karosserie,ID_Fahrzeug)
glimpse(car_parts_type11)
## Rows: 1,977,164
## Columns: 2
## $ ID_Karosserie <chr> "K4-112-1121-3", "K4-112-1121-4", "K4-112-1121-7", "K4-1…
## $ ID_Fahrzeug   <chr> "11-1-11-1", "11-1-11-2", "11-1-11-3", "11-1-11-4", "11-…
kable(head(car_parts_type11)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_parts_type11), width_max = "12em", width_min = "8em")
ID_Karosserie ID_Fahrzeug
K4-112-1121-3 11-1-11-1
K4-112-1121-4 11-1-11-2
K4-112-1121-7 11-1-11-3
K4-112-1121-9 11-1-11-4
K4-112-1121-11 11-1-11-5
K4-112-1121-25 11-1-11-6
car_parts_type12 <- Drop_Cols(car_parts_type12) %>% dplyr::select(ID_Karosserie,ID_Fahrzeug)
glimpse(car_parts_type12)
## Rows: 408,096
## Columns: 2
## $ ID_Karosserie <chr> "K5-112-1122-1", "K5-112-1122-11", "K5-112-1122-2", "K5-…
## $ ID_Fahrzeug   <chr> "12-1-12-1", "12-1-12-2", "12-1-12-3", "12-1-12-4", "12-…
kable(head(car_parts_type12)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_parts_type12), width_max = "12em", width_min = "8em")
ID_Karosserie ID_Fahrzeug
K5-112-1122-1 12-1-12-1
K5-112-1122-11 12-1-12-2
K5-112-1122-2 12-1-12-3
K5-112-1122-3 12-1-12-4
K5-112-1122-4 12-1-12-5
K5-112-1122-6 12-1-12-6
car_parts_type21 <- Drop_Cols(car_parts_type21) %>% dplyr::select(ID_Karosserie,ID_Fahrzeug)
glimpse(car_parts_type21)
## Rows: 512,354
## Columns: 2
## $ ID_Karosserie <chr> "K6-113-1132-1", "K6-113-1132-2", "K6-113-1132-8", "K6-1…
## $ ID_Fahrzeug   <chr> "21-2-21-1", "21-2-21-2", "21-2-21-4", "21-2-21-5", "21-…
kable(head(car_parts_type21)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_parts_type21), width_max = "12em", width_min = "8em")
ID_Karosserie ID_Fahrzeug
K6-113-1132-1 21-2-21-1
K6-113-1132-2 21-2-21-2
K6-113-1132-8 21-2-21-4
K6-113-1132-9 21-2-21-5
K6-113-1132-11 21-2-21-6
K6-113-1132-12 21-2-21-7
car_parts_type22 <- Drop_Cols(car_parts_type22) %>% dplyr::select(ID_Karosserie,ID_Fahrzeug)
glimpse(car_parts_type22)
## Rows: 306,490
## Columns: 2
## $ ID_Karosserie <chr> "K7-114-1142-1", "K7-114-1142-2", "K7-114-1142-5", "K7-1…
## $ ID_Fahrzeug   <chr> "22-2-21-1", "22-2-21-2", "22-2-21-3", "22-2-21-4", "22-…
kable(head(car_parts_type22)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(car_parts_type22), width_max = "12em", width_min = "8em")
ID_Karosserie ID_Fahrzeug
K7-114-1142-1 22-2-21-1
K7-114-1142-2 22-2-21-2
K7-114-1142-5 22-2-21-3
K7-114-1142-8 22-2-21-4
K7-114-1142-11 22-2-21-5
K7-114-1142-13 22-2-21-6

3.2.5 Tidying of Components:

# From the components data, we believe that only the columns ID_Karosserie will be relevant for our analysis and therefore we only select those variables

comp_4 <- Drop_Cols(comp_4) %>% Fix_unstruct_dat()%>% dplyr::select(ID_Karosserie)
glimpse(comp_4)
## Rows: 1,977,164
## Columns: 1
## $ ID_Karosserie <chr> "K4-112-1121-3", "K4-112-1121-4", "K4-112-1121-7", "K4-1…
kable(head(comp_4)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_4), width_max = "12em", width_min = "8em")
ID_Karosserie
K4-112-1121-3
K4-112-1121-4
K4-112-1121-7
K4-112-1121-9
K4-112-1121-11
K4-112-1121-25
comp_5 <- Drop_Cols(comp_5) %>% Fix_unstruct_dat() %>% dplyr::select(ID_Karosserie)
glimpse(comp_5)
## Rows: 408,096
## Columns: 1
## $ ID_Karosserie <chr> "K5-112-1122-1", "K5-112-1122-11", "K5-112-1122-2", "K5-…
kable(head(comp_5)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_5), width_max = "12em", width_min = "8em")
ID_Karosserie
K5-112-1122-1
K5-112-1122-11
K5-112-1122-2
K5-112-1122-3
K5-112-1122-4
K5-112-1122-6
comp_6 <- Drop_Cols(comp_6) %>% Prod_date() %>% dplyr::select(ID_Karosserie)
glimpse(comp_6)
## Rows: 512,354
## Columns: 1
## $ ID_Karosserie <chr> "K6-113-1132-1", "K6-113-1132-2", "K6-113-1132-3", "K6-1…
kable(head(comp_6)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_6), width_max = "12em", width_min = "8em")
ID_Karosserie
K6-113-1132-1
K6-113-1132-2
K6-113-1132-3
K6-113-1132-8
K6-113-1132-9
K6-113-1132-11
comp_7 <- Drop_Cols(comp_7) %>% Prod_date() %>% dplyr::select(ID_Karosserie)
glimpse(comp_7)
## Rows: 306,490
## Columns: 1
## $ ID_Karosserie <chr> "K7-114-1142-1", "K7-114-1142-2", "K7-114-1142-5", "K7-1…
kable(head(comp_7)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_7), width_max = "12em", width_min = "8em")
ID_Karosserie
K7-114-1142-1
K7-114-1142-2
K7-114-1142-5
K7-114-1142-8
K7-114-1142-11
K7-114-1142-13

3.2.6 Tidying of Component parts data,i.e, Bestandteile_Komponente:

comp_parts_4 <- Drop_Cols(comp_parts_4)
glimpse(comp_parts_4)
## Rows: 1,977,164
## Columns: 4
## $ ID_T30 <chr> "30-216-2161-242", "30-216-2161-87", "30-217-2171-214", "30-216…
## $ ID_T31 <chr> "31-217-2172-302", "31-216-2161-242", "31-217-2172-53", "31-217…
## $ ID_T32 <chr> "32-216-2162-21", "32-216-2162-48", "32-216-2162-62", "32-216-2…
## $ ID_K4  <chr> "K4-112-1121-1", "K4-112-1121-2", "K4-112-1121-3", "K4-112-1121…
kable(head(comp_parts_4)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_parts_4), width_max = "12em", width_min = "8em")
ID_T30 ID_T31 ID_T32 ID_K4
30-216-2161-242 31-217-2172-302 32-216-2162-21 K4-112-1121-1
30-216-2161-87 31-216-2161-242 32-216-2162-48 K4-112-1121-2
30-217-2171-214 31-217-2172-53 32-216-2162-62 K4-112-1121-3
30-216-2161-144 31-217-2172-139 32-216-2162-134 K4-112-1121-4
30-216-2161-71 31-216-2161-260 32-216-2162-153 K4-112-1121-5
30-216-2161-214 31-216-2161-328 32-216-2162-165 K4-112-1121-6
comp_parts_5 <- Drop_Cols(comp_parts_5)
glimpse(comp_parts_5)
## Rows: 408,096
## Columns: 4
## $ ID_T30 <chr> "30-217-2171-67", "30-216-2162-3", "30-216-2162-10", "30-216-21…
## $ ID_T31 <chr> "31-216-2161-15", "31-216-2161-67", "31-216-2161-125", "31-216-…
## $ ID_T33 <chr> "33-217-2172-32", "33-217-2172-82", "33-216-2162-8", "33-216-21…
## $ ID_K5  <chr> "K5-112-1122-1", "K5-112-1122-2", "K5-112-1122-3", "K5-112-1122…
kable(head(comp_parts_5)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_parts_5), width_max = "12em", width_min = "8em")
ID_T30 ID_T31 ID_T33 ID_K5
30-217-2171-67 31-216-2161-15 33-217-2172-32 K5-112-1122-1
30-216-2162-3 31-216-2161-67 33-217-2172-82 K5-112-1122-2
30-216-2162-10 31-216-2161-125 33-216-2162-8 K5-112-1122-3
30-216-2162-11 31-216-2161-316 33-216-2162-11 K5-112-1122-4
30-216-2162-50 31-217-2172-99 33-216-2162-12 K5-112-1122-5
30-216-2162-53 31-217-2172-183 33-216-2162-14 K5-112-1122-6
# We do not select the column ID_T36 as our company does not produce part 36
comp_parts_6 <- Drop_Cols(comp_parts_6) %>% dplyr::select(-ID_T36)
glimpse(comp_parts_6)
## Rows: 512,354
## Columns: 4
## $ ID_T34 <chr> "34-217-2173-51", "34-217-2173-45", "34-218-2181-24", "34-217-2…
## $ ID_T35 <chr> "35-217-2173-9", "35-217-2173-64", "35-217-2173-107", "35-218-2…
## $ ID_T37 <chr> "37-218-2183-22", "37-218-2183-52", "37-218-2182-11", "37-217-2…
## $ ID_K6  <chr> "K6-113-1132-1", "K6-113-1132-2", "K6-113-1132-3", "K6-113-1132…
kable(head(comp_parts_6)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_parts_6), width_max = "12em", width_min = "8em")
ID_T34 ID_T35 ID_T37 ID_K6
34-217-2173-51 35-217-2173-9 37-218-2183-22 K6-113-1132-1
34-217-2173-45 35-217-2173-64 37-218-2183-52 K6-113-1132-2
34-218-2181-24 35-217-2173-107 37-218-2182-11 K6-113-1132-3
34-217-2173-36 35-218-2181-200 37-217-2173-6 K6-113-1132-4
34-218-2181-65 35-217-2173-2 37-217-2173-7 K6-113-1132-5
34-218-2181-139 35-217-2173-152 37-217-2173-10 K6-113-1132-6
# We only select the columns of the parts produced by our company
comp_parts_7 <- Drop_Cols(comp_parts_7) %>% dplyr::select(ID_T34, ID_T35, ID_K7)
glimpse(comp_parts_7)
## Rows: 306,490
## Columns: 3
## $ ID_T34 <chr> "34-217-2173-2", "34-217-2173-24", "34-218-2181-42", "34-218-21…
## $ ID_T35 <chr> "35-218-2181-58", "35-217-2173-92", "35-218-2181-2", "35-218-21…
## $ ID_K7  <chr> "K7-114-1142-1", "K7-114-1142-2", "K7-114-1142-3", "K7-114-1142…
kable(head(comp_parts_7)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(comp_parts_7), width_max = "12em", width_min = "8em")
ID_T34 ID_T35 ID_K7
34-217-2173-2 35-218-2181-58 K7-114-1142-1
34-217-2173-24 35-217-2173-92 K7-114-1142-2
34-218-2181-42 35-218-2181-2 K7-114-1142-3
34-218-2182-13 35-218-2181-12 K7-114-1142-4
34-218-2182-45 35-218-2181-15 K7-114-1142-5
34-217-2173-15 35-218-2181-16 K7-114-1142-6

3.2.7 Tidying of parts data:

# We decided to rename the ID columns of each part dataset, because this will facilitate binding the rows later.


order_parts <- c("ID_Part", "Produktionsdatum", "Fehlerhaft", "Fehlerhaft_Datum", 
                 "Fehlerhaft_Fahrleistung")


P_30 <- Drop_Cols(P_30) %>% Fix_unstruct_dat() %>% Filter_217_dates() %>% 
  rename(ID_Part = ID_T30) %>% dplyr::select(order_parts) %>% relocate(order_parts)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(order_parts)` instead of `order_parts` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
glimpse(P_30)
## Rows: 193,798
## Columns: 5
## $ ID_Part                 <chr> "30-217-2171-317715", "30-217-2171-317698", "3…
## $ Produktionsdatum        <date> 2012-06-01, 2012-06-01, 2012-06-01, 2012-06-0…
## $ Fehlerhaft              <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0…
## $ Fehlerhaft_Datum        <date> NA, NA, NA, NA, NA, NA, NA, 2012-12-17, NA, N…
## $ Fehlerhaft_Fahrleistung <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1181…
kable(head(P_30)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_30), width_max = "12em", width_min = "8em")
ID_Part Produktionsdatum Fehlerhaft Fehlerhaft_Datum Fehlerhaft_Fahrleistung
30-217-2171-317715 2012-06-01 0 NA 0
30-217-2171-317698 2012-06-01 0 NA 0
30-217-2171-317776 2012-06-01 0 NA 0
30-217-2171-317706 2012-06-01 0 NA 0
30-217-2171-318036 2012-06-02 0 NA 0
30-217-2171-317926 2012-06-01 0 NA 0
P_31 <- Drop_Cols(P_31) %>% Prod_date() %>% Filter_217_dates() %>% 
  rename(ID_Part = ID_T31) %>% dplyr::select(order_parts) %>% relocate(order_parts)
glimpse(P_31)
## Rows: 258,679
## Columns: 5
## $ ID_Part                 <chr> "31-217-2172-424554", "31-217-2172-952233", "3…
## $ Produktionsdatum        <date> 2012-06-01, 2012-06-01, 2012-06-01, 2012-06-0…
## $ Fehlerhaft              <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ Fehlerhaft_Datum        <date> NA, 2012-09-20, NA, NA, NA, NA, NA, NA, NA, N…
## $ Fehlerhaft_Fahrleistung <dbl> 0.000, 1142.860, 0.000, 0.000, 0.000, 0.000, 0…
kable(head(P_31)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_31), width_max = "12em", width_min = "8em")
ID_Part Produktionsdatum Fehlerhaft Fehlerhaft_Datum Fehlerhaft_Fahrleistung
31-217-2172-424554 2012-06-01 0 NA 0.00
31-217-2172-952233 2012-06-01 1 2012-09-20 1142.86
31-217-2172-424488 2012-06-01 0 NA 0.00
31-217-2172-424684 2012-06-01 0 NA 0.00
31-217-2172-424465 2012-06-01 0 NA 0.00
31-217-2172-424686 2012-06-01 0 NA 0.00
P_32 <- Drop_Cols(P_32) %>% Fix_unstruct_dat() %>% Filter_217_dates() %>% 
  rename(ID_Part = ID_T32) %>% dplyr::select(order_parts) %>% relocate(order_parts)
glimpse(P_32)
## Rows: 268,191
## Columns: 5
## $ ID_Part                 <chr> "32-217-2171-440142", "32-217-2171-440262", "3…
## $ Produktionsdatum        <date> 2012-06-01, 2012-06-01, 2012-06-01, 2012-06-0…
## $ Fehlerhaft              <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ Fehlerhaft_Datum        <date> NA, NA, NA, 2014-05-05, NA, NA, NA, NA, NA, N…
## $ Fehlerhaft_Fahrleistung <dbl> 0.00, 0.00, 0.00, 54927.68, 0.00, 0.00, 0.00, …
kable(head(P_32)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_32), width_max = "12em", width_min = "8em")
ID_Part Produktionsdatum Fehlerhaft Fehlerhaft_Datum Fehlerhaft_Fahrleistung
32-217-2171-440142 2012-06-01 0 NA 0.00
32-217-2171-440262 2012-06-01 0 NA 0.00
32-217-2171-440377 2012-06-01 0 NA 0.00
32-217-2171-440470 2012-06-02 1 2014-05-05 54927.68
32-217-2171-440083 2012-06-01 0 NA 0.00
32-217-2171-440187 2012-06-01 0 NA 0.00
P_33 <- Drop_Cols(P_33) %>% Prod_date() %>% Filter_217_dates() %>% 
  rename(ID_Part = ID_T33) %>% dplyr::select(order_parts) %>% relocate(order_parts)
glimpse(P_33)
## Rows: 66,443
## Columns: 5
## $ ID_Part                 <chr> "33-217-2172-108136", "33-217-2172-108116", "3…
## $ Produktionsdatum        <date> 2012-06-01, 2012-06-01, 2012-06-01, 2012-06-0…
## $ Fehlerhaft              <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ Fehlerhaft_Datum        <date> NA, NA, NA, NA, NA, NA, 2013-11-01, NA, 2013-…
## $ Fehlerhaft_Fahrleistung <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 38781.60, …
kable(head(P_33)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_33), width_max = "12em", width_min = "8em")
ID_Part Produktionsdatum Fehlerhaft Fehlerhaft_Datum Fehlerhaft_Fahrleistung
33-217-2172-108136 2012-06-01 0 NA 0
33-217-2172-108116 2012-06-01 0 NA 0
33-217-2172-108084 2012-06-01 0 NA 0
33-217-2172-108091 2012-06-01 0 NA 0
33-217-2172-108094 2012-06-01 0 NA 0
33-217-2172-108097 2012-06-01 0 NA 0
P_34 <- Drop_Cols(P_34) %>% Prod_date() %>% Filter_217_dates() %>% 
  rename(ID_Part = ID_T34) %>% dplyr::select(order_parts) %>% relocate(order_parts)
glimpse(P_34)
## Rows: 67,222
## Columns: 5
## $ ID_Part                 <chr> "34-217-2173-109424", "34-217-2173-109442", "3…
## $ Produktionsdatum        <date> 2012-06-01, 2012-06-01, 2012-06-02, 2012-06-0…
## $ Fehlerhaft              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Fehlerhaft_Datum        <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Fehlerhaft_Fahrleistung <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
kable(head(P_34)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_34), width_max = "12em", width_min = "8em")
ID_Part Produktionsdatum Fehlerhaft Fehlerhaft_Datum Fehlerhaft_Fahrleistung
34-217-2173-109424 2012-06-01 0 NA 0
34-217-2173-109442 2012-06-01 0 NA 0
34-217-2173-109459 2012-06-02 0 NA 0
34-217-2173-109483 2012-06-02 0 NA 0
34-217-2173-109531 2012-06-02 0 NA 0
34-217-2173-109432 2012-06-01 0 NA 0
P_35$Fehlerhaft_Fahrleistung.y <-as.double(P_35$Fehlerhaft_Fahrleistung.y)
P_35 <- Drop_Cols(P_35) %>% Fix_unstruct_dat() %>% Filter_217_dates() %>% 
  rename(ID_Part = ID_T35) %>% dplyr::select(order_parts) %>% relocate(order_parts)
glimpse(P_35)
## Rows: 133,669
## Columns: 5
## $ ID_Part                 <chr> "35-217-2173-218806", "35-217-2173-218777", "3…
## $ Produktionsdatum        <date> 2012-06-01, 2012-06-01, 2012-06-01, 2012-06-0…
## $ Fehlerhaft              <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ Fehlerhaft_Datum        <date> NA, NA, NA, NA, 2013-11-20, NA, NA, NA, NA, N…
## $ Fehlerhaft_Fahrleistung <dbl> 0.00, 0.00, 0.00, 0.00, 37673.98, 0.00, 0.00, …
kable(head(P_35)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_35), width_max = "12em", width_min = "8em")
ID_Part Produktionsdatum Fehlerhaft Fehlerhaft_Datum Fehlerhaft_Fahrleistung
35-217-2173-218806 2012-06-01 0 NA 0.00
35-217-2173-218777 2012-06-01 0 NA 0.00
35-217-2173-218800 2012-06-01 0 NA 0.00
35-217-2173-218975 2012-06-02 0 NA 0.00
35-217-2173-218818 2012-06-01 1 2013-11-20 37673.98
35-217-2173-218872 2012-06-01 0 NA 0.00
P_37 <- Drop_Cols(P_37) %>% Prod_date() %>% Filter_217_dates() %>% 
  rename(ID_Part = ID_T37) %>% dplyr::select(order_parts) %>% relocate(order_parts)
glimpse(P_37)
## Rows: 14,202
## Columns: 5
## $ ID_Part                 <chr> "37-217-2173-22140", "37-217-2173-22141", "37-…
## $ Produktionsdatum        <date> 2012-06-01, 2012-06-01, 2012-06-01, 2012-06-0…
## $ Fehlerhaft              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Fehlerhaft_Datum        <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Fehlerhaft_Fahrleistung <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
kable(head(P_37)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(P_37), width_max = "12em", width_min = "8em")
ID_Part Produktionsdatum Fehlerhaft Fehlerhaft_Datum Fehlerhaft_Fahrleistung
37-217-2173-22140 2012-06-01 0 NA 0
37-217-2173-22141 2012-06-01 0 NA 0
37-217-2173-22136 2012-06-01 0 NA 0
37-217-2173-22146 2012-06-01 0 NA 0
37-217-2173-22150 2012-06-01 0 NA 0
37-217-2173-22164 2012-06-02 0 NA 0
rm(order_parts)

4. Creation of the final dataset:

Now that we have successfully tidied, cleaned, and filtered the datasets, it is time to create the final dataset through a step-by-step joining procedure.

4.1 Joining the Registration dataset with the geodata:

We make use of a left join using the community variables as key to extract the geolocation data and join it with the registration data of each car.

registrations <- registrations %>%
  left_join(geodata, by = c('Gemeinden' = 'Gemeinde'))%>%
  relocate(IDNummer, Zulassung, Gemeinden, Postleitzahl, Laengengrad,Breitengrad)

rm(geodata)
kable(sample_n(registrations,6)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F, position = "left") %>% column_spec(1:ncol(registrations), width_max = "12em", width_min = "8em")
IDNummer Zulassung Gemeinden Postleitzahl Laengengrad Breitengrad
21-2-21-241800 2012-12-24 MICHENDORF 14552 13.028498 52.30963
22-2-21-58061 2016-12-02 KARLSDORF NEUTHARD 76689 8.545484 49.13615
22-2-22-171008 2014-10-02 LUSTADT 67363 8.274533 49.24452
11-1-11-1034907 2016-02-02 BAIENFURT 88255 9.651432 47.82698
21-2-21-123285 2011-02-21 BIELEFELD 33602 8.532120 52.02058
21-2-21-494007 2016-10-21 OBERHAUSEN 46045 6.852291 51.47032

4.2 Joining the Cars data with the registration data:

In this part, we first join the data of each car type with the data of the parts used to produce that car joining by the variable ID_Fahrzeug. We then created a single dataset containing the data of all different car types. Finally, we join this cars dataset with the registration dataset yielding the following result.

car_type11 <- car_type11  %>% left_join(car_parts_type11, "ID_Fahrzeug")
rm(car_parts_type11)


car_type12 <- car_type12  %>% left_join(car_parts_type12, "ID_Fahrzeug")
rm(car_parts_type12)


car_type21 <- car_type21  %>% left_join(car_parts_type21, "ID_Fahrzeug")
rm(car_parts_type21)


car_type22 <- car_type22  %>% left_join(car_parts_type22, "ID_Fahrzeug")
rm(car_parts_type22)

cars <- bind_rows(car_type11,car_type12,car_type21,car_type22)
rm(car_type11,car_type12,car_type21,car_type22)



cars <- cars %>% left_join(registrations, by = c('ID_Fahrzeug' = 'IDNummer')) %>% distinct()
rm(registrations)

kable(sample_n(cars,6)) %>% 
  kable_styling(bootstrap_options = "striped",full_width = F,font_size = 9, position = "left") %>% 
  column_spec(1:ncol(cars), width_max = "12em", width_min = "8em")
ID_Fahrzeug Herstellernummer Werksnummer ID_Karosserie Zulassung Gemeinden Postleitzahl Laengengrad Breitengrad
21-2-21-135783 2 21 K6-113-1132-135777 2011-04-27 ILVESHEIM 68549 8.567892 49.47232
11-1-11-544405 1 11 K4-114-1141-115751 2012-10-19 PIRMASENS 66953 7.608220 49.19741
22-2-22-21313 2 22 K7-114-1142-27017 2009-09-01 ORENHOFEN 54298 6.650700 49.89688
11-1-12-376491 1 12 K4-114-1141-157815 2012-12-19 WOLFHAGEN 34466 9.169659 51.32602
11-1-12-323907 1 12 K4-114-1141-25715 2012-06-01 KLEIN MECKELSEN 27419 9.451117 53.30395
11-1-11-234328 1 11 K4-112-1121-389796 2010-09-13 MUENSTER 48143 7.626804 51.96191

4.3 Joining the components data with cars data:

Before joining the components data with the cars data, we need to join the components with the data on parts used to build these components. However, when we do this, we will encounter a data frame containing a column for each part used to build the component. Instead, we want a single column for all parts. That is when we make use of the pivot_longer() function, which “lengthens” data, increasing the number of rows and decreasing the number of columns. In essence, it does the following:

Therefore, we create a special pivot function for the components data named Pivot_comp_parts:

Pivot_comp_parts <- function(df){
  df %>%
    pivot_longer(cols = starts_with("ID_T"), # Takes all the columns starting with ID_T
                 names_to = "Part_Type", # Classify them into a column named Part_Type
                 names_prefix = "ID_T", # Remove the prefix ID_T from Part_Type
                 values_to = "ID_Part") # Put the values of the selected colunms into ID_Part
}

After creating the Pivot_comp_parts function we can proceed to join the data of the components with the data on parts used to build these components and finally we can left join the cars dataset yuilding the following result:

comp_4 <- comp_4 %>% left_join(comp_parts_4, by = c('ID_Karosserie' = 'ID_K4')) %>% 
  Pivot_comp_parts()
rm(comp_parts_4)


comp_5 <- comp_5 %>% left_join(comp_parts_5, by = c('ID_Karosserie' = 'ID_K5')) %>%
  Pivot_comp_parts()
rm(comp_parts_5)


comp_6 <- comp_6 %>% left_join(comp_parts_6, by = c('ID_Karosserie' = 'ID_K6')) %>%
  Pivot_comp_parts()
rm(comp_parts_6)

comp_7 <- comp_7 %>% left_join(comp_parts_7, by = c('ID_Karosserie' = 'ID_K7')) %>%
  Pivot_comp_parts()
rm(comp_parts_7)


components <- bind_rows(comp_4,comp_5,comp_6,comp_7) %>% distinct()
rm(comp_4,comp_5, comp_6,comp_7)

components <- components %>% left_join(cars, "ID_Karosserie") %>% distinct()
rm(cars)

kable(sample_n(components,6))%>%
  kable_styling(bootstrap_options = "striped",full_width = F,font_size = 8, position = "left") %>% 
  column_spec(1:ncol(components), width_max = "12em", width_min = "8em")
ID_Karosserie Part_Type ID_Part ID_Fahrzeug Herstellernummer Werksnummer Zulassung Gemeinden Postleitzahl Laengengrad Breitengrad
K4-112-1121-613177 30 30-217-2171-220546 11-1-12-242946 1 12 2011-08-10 ALSHEIM 67577 8.341322 49.76793
K6-113-1132-236468 34 34-217-2173-111886 21-2-21-236465 2 21 2012-11-15 LUCKENWALDE 14943 13.173009 52.08925
K5-112-1122-215996 30 30-217-2171-376277 12-1-12-216009 1 12 2013-05-10 WAHRENHOLZ 29399 10.603296 52.61567
K4-114-1141-36505 30 30-216-2162-198939 11-1-11-496156 1 11 2012-06-15 BAD ESSEN 49152 8.343951 52.31962
K4-112-1121-744525 30 30-216-2161-267766 11-1-12-295161 1 12 2012-02-17 MAINZ 55116 8.271855 50.00196
K4-114-1141-933955 31 31-216-2161-1245910 11-1-11-1034010 1 11 2016-01-27 WAGHAEUSEL 68753 8.514287 49.25073

4.4 Joining of parts data with components:

Creating a parts dataset is relatively simple as we have previously tidied, cleaned, and filtered all the parts. By applying bind_rows() to all the parts data sets and then erasing duplicates with distinct(), we get the parts dataset containing all the relevant parts for our analysis. Finally, to get the final dataset, which will be used for our shiny app, we simply need to left join the componets data set to the parts dataset by the created column ID_Part

parts <- bind_rows(P_30,P_31,P_32,P_33,P_34,P_35,P_37) %>% distinct()
rm(P_30,P_31,P_32,P_33,P_34,P_35,P_37)


Final_dataset_group_20  <- parts %>% left_join(components, "ID_Part") %>% distinct()
rm(parts, components)

4.5 Final Dataset:

Finally, we will change the column names to their equivalent English translation for a matter of simplicity,reorder the dataset colunms, preview the head of the created dataset and save the entire dataset in the project folder as a CSV file.

Names_Dataset <- c("ID_Part","Production_Date_Part","Defective_Part", "Defective_Part_Date",
                   "Mileage_Part", "ID_Comp", "Part_Type", "ID_Car",
                   "OEM","Factory_Number","Registration",
                   "Community","Postal_Code", "Longitude", "Latitude")

names(Final_dataset_group_20) <- Names_Dataset
rm(Names_Dataset)

Final_dataset_group_20 <- Final_dataset_group_20 %>% relocate("ID_Part","Part_Type","Production_Date_Part","Defective_Part", 
         "Defective_Part_Date", "Mileage_Part", "ID_Comp", "ID_Car",
         "OEM","Factory_Number","Registration","Community","Postal_Code",
         "Longitude", "Latitude")

kable(sample_n(Final_dataset_group_20,6)) %>%
  kable_styling(bootstrap_options = "striped",full_width = F,font_size = 6, position = "left") %>% 
  column_spec(1:ncol(Final_dataset_group_20), width_max = "12em", width_min = "8em")
ID_Part Part_Type Production_Date_Part Defective_Part Defective_Part_Date Mileage_Part ID_Comp ID_Car OEM Factory_Number Registration Community Postal_Code Longitude Latitude
30-217-2171-406214 30 2013-05-29 0 NA 0 K4-114-1141-337247 11-1-12-448152 1 12 2013-09-10 EURASBURG 82547 11.403471 47.85593
33-217-2172-131537 33 2013-03-10 0 NA 0 K5-112-1122-221981 12-1-12-221886 1 12 2013-06-19 UETZE 31311 10.202538 52.46395
35-217-2173-331080 35 2014-03-31 0 NA 0 K7-113-1132-54809 22-2-22-164500 2 22 2014-07-16 MAINZ 55116 8.271855 50.00196
31-217-2172-623162 31 2014-01-31 0 NA 0 K5-113-1131-22543 12-1-12-267524 1 12 2014-05-08 KOELN 50667 6.957068 50.93811
34-217-2173-131875 34 2013-02-23 0 NA 0 K7-113-1132-12479 22-2-22-131093 2 22 2013-06-10 POESSNECK 07381 11.593211 50.69355
33-217-2172-144253 33 2013-08-10 0 NA 0 K5-112-1122-243386 12-1-12-243480 1 12 2013-11-22 BAD STAFFELSTEIN 96231 10.999853 50.10147
write.csv(Final_dataset_group_20,"Final_dataset_group_20.csv", row.names = FALSE)

5. Results:

In this section we will shortly display the different tabs of our shiny app that was built using the final dataset presented in the previous section.

5.1 Home tab:

Here we briefly introduce the problem to enable the users of the app to understand the purpose for which the app was built.

5.2 Damage Hotspot (Heatmap):

We also included community specific popups into the app that display the following information:

  • Community
  • Postcode
  • Longitude
  • Latitude
  • Number of Defective Cars
  • Number of Registered Cars
  • Share of Defective Cars (In the community)

5.3 Delivered Quantities:

In this tab, customers are able to view the quantities of parts supplied to the different plants of vehicle manufacturers OEM1 and OEM2. These include: 11-OEM:1, 12-OEM:1, 21-OEM:2 and 22-OEM:2.The production period can be broken down by the user into years, months, and weeks of the period in which non-compliant steel was used.

5.4 Useful life of body parts:

In this section, users are able to compare the locality, spread and skewness of the variable mileage,i.e, kilometers traveled before the part was found defective, for any two selectable communities.

5.5 Vehicle ID:

Based on the vehicle ID, users can find out whether their car contained a part that was found defective or not, or even if their car was affected by this incident, i.e., if their car was contains a part that was produced with non-compliant steel or not.

5.6 Final Dataset:

Finally, in this section, the users will find the dataset used to build the shiny app. In this way, we allow our customers to have full transparency over the process.

6. Evaluation of final results:

In this section we will shortly discuss some of the insights gained through the development of our web application.

6.1 Heatmap tab of the damage hotspots:

Throughout the heatmap of the damaged car’s hotspot in the second tab of our application, we were able to identify three specific federal states which were significantly affected by the use of non-compliant steel in our products:

  • North Rhine Westphalia: five cities in this state are among the top ten cities with the most cases of proven damaged cars. These include: Cologne, Dortmund, Bochum, Bonn, and Bielefeld. This insight leads us to the conclusion that this state is the primary hotspot for affected cars across the country.

  • Saxony: the state of Saxony takes second place as an affected car hotspot. Including two cities in the top ten, namely Dresden and Leipzig.

  • Bavaria: The southern state takes the third place, including cities like Munster and Augsburg.

6.2 Delivered quantities tab:

The bar plot in our second tab of the web application delivered part quantities, allowed us to identify that the plants of the car manufacturer OEM1, namely, plants 11 and 12 received significantly more parts than the plants of the car manufacturer OEM2, plants 21 and 22, over the entire period of production. In particular, plant 12 received the major quantities of parts, while plant 22 was supplied with the lowest quantity of parts during this period.

This leads us to the assumption that the car manufacturer OEM1 should have been delivered with a major quantity of defective parts than its counterpart OEM2. We will test this assumption at the car manufacturer level as well as plant level to see if the data matches the information given by the bar plot.

Final_dataset_Group_20 <-read_csv("Final_dataset_Group_20.csv")
#Calculate the total number of defective parts delivered throughout the entire period. (will be used laster)
Tot_Num_Defective_Cases <- nrow(Final_dataset_Group_20 %>%
  filter(Defective_Part == 1))

Manufacturer_Level<- Final_dataset_Group_20 %>% 
  filter(Defective_Part == 1)%>% #Filter for defective parts
  group_by(OEM) %>% #Group by car manufacturer
  count(OEM, sort =T)%>% # Count the number of cases by group
  rename('Number of defective part cases' = n) %>% # Rename n variable
  mutate('Share of defective parts delivered to the OEM' = 
           paste(round(((`Number of defective part cases`/Tot_Num_Defective_Cases)*100),2),"%", sep = "")) #Create a new variable which computes the share of defective parts by OEM

Plant_Level<- Final_dataset_Group_20 %>% 
  filter(Defective_Part == 1)%>% #Filter for defective parts
  group_by(Factory_Number) %>% #Group by plant
  count(Factory_Number, sort =T)%>% # Count the number of cases by group
  rename('Number of defective part cases' = n) %>% # Rename n variable
  rename('Plant Number' = Factory_Number) %>% # Rename Factory_Number variable
  mutate('Share of defective parts delivered to the plant' = 
           paste(round(((`Number of defective part cases`/Tot_Num_Defective_Cases)*100),2),"%", sep = "")) #Create a new variable which computes the share of defective parts by OEM
  
rm(Tot_Num_Defective_Cases) #Not necessary anymore

kable(Manufacturer_Level)%>%
  kable_styling(bootstrap_options = "striped",full_width = F, position = "center") %>%
  column_spec(1:ncol(Manufacturer_Level), width_max = "12em", width_min = "8em")
OEM Number of defective part cases Share of defective parts delivered to the OEM
1 78948 78.42%
2 21719 21.58%
kable(Plant_Level)%>%
  kable_styling(bootstrap_options = "striped",full_width = F, position = "center") %>%
  column_spec(1:ncol(Plant_Level), width_max = "12em", width_min = "8em")
Plant Number Number of defective part cases Share of defective parts delivered to the plant
12 40088 39.82%
11 38860 38.6%
21 15630 15.53%
22 6089 6.05%

The data on the tables created above confirm our expectations. Now, we are in a position where we can confirm that OEM1 was not only supplied with a major quantity of parts during this period but also possessed a major number of defective parts than its counterpart OEM2. At the plant level, plant 12 had the largest share of defective parts, followed by plant 11. Consequently, plants 21 and 22 received a smaller share of defective parts, which aligns with the insights taken from the bar plot in our web application.

This information would suggest that car types of the manufacturer OEM1 would be more prone to getting rust damages from the use of non-compliant steel than cars of the manufacturer OEM2.

6.3 Useful life of body parts:

The vehicle’s useful life, considered in the analysis, corresponds to the number of kilometres travelled before the defect was found. Moreover, their visualization and the comparisons, based on the registration community, are performed through the usage of boxplots. Indeed, boxplots display the distribution of a set of data through five key points. The minimum and maximum points are represented by the horizontal lines, while the colored box represents the interquartile range, and is limited by the first and third quartile, i.e. the values under which the 25% and 75% of points are found. The bold line in the middle of the box is the median, or second quartile. Finally, outliers are shown as points.
This type of plot is useful for comparing the shape of several distributions and assessing their symmetry, and allows to identify in a reliable and precise way the presence of outliers.

However, the set of data of the case study contains also negative values for the mileage of the defective vehicles. This is clearly incorrect, therefore, the boxplots are not reliable and cannot be exploited for further analysis.